Tachanun KANGWANTRAKOOL Kobkrit VIRIYAYUDHAKORN Thanaruk THEERAMUNKONG
Most existing methods of effort estimations in software development are manual, labor-intensive and subjective, resulting in overestimation with bidding fail, and underestimation with money loss. This paper investigates effectiveness of sequence models on estimating development effort, in the form of man-months, from software project data. Four architectures; (1) Average word-vector with Multi-layer Perceptron (MLP), (2) Average word-vector with Support Vector Regression (SVR), (3) Gated Recurrent Unit (GRU) sequence model, and (4) Long short-term memory (LSTM) sequence model are compared in terms of man-months difference. The approach is evaluated using two datasets; ISEM (1,573 English software project descriptions) and ISBSG (9,100 software projects data), where the former is a raw text and the latter is a structured data table explained the characteristic of a software project. The LSTM sequence model achieves the lowest and the second lowest mean absolute errors, which are 0.705 and 14.077 man-months for ISEM and ISBSG datasets respectively. The MLP model achieves the lowest mean absolute errors which is 14.069 for ISBSG datasets.
Yoonhee KIM Deokgyu YUN Hannah LEE Seung Ho CHOI
This paper presents a deep learning-based non-intrusive speech intelligibility estimation method using bottleneck features of autoencoder. The conventional standard non-intrusive speech intelligibility estimation method, P.563, lacks intelligibility estimation performance in various noise environments. We propose a more accurate speech intelligibility estimation method based on long-short term memory (LSTM) neural network whose input and output are an autoencoder bottleneck features and a short-time objective intelligence (STOI) score, respectively, where STOI is a standard tool for measuring intrusive speech intelligibility with reference speech signals. We showed that the proposed method has a superior performance by comparing with the conventional standard P.563 and mel-frequency cepstral coefficient (MFCC) feature-based intelligibility estimation methods for speech signals in various noise environments.
This letter presents a novel technique to achieve a fast inference of the binarized convolutional neural networks (BCNN). The proposed technique modifies the structure of the constituent blocks of the BCNN model so that the input elements for the max-pooling operation are binary. In this structure, if any of the input elements is +1, the result of the pooling can be produced immediately; the proposed technique eliminates such computations that are involved to obtain the remaining input elements, so as to reduce the inference time effectively. The proposed technique reduces the inference time by up to 34.11%, while maintaining the classification accuracy.
Handwritten numeral recognition is a classical and important task in the computer vision area. We propose two novel deep learning models for this task, which combine the edge extraction method and Siamese/Triple network structures. We evaluate the models on seven handwritten numeral datasets and the results demonstrate both the simplicity and effectiveness of our models, comparing to baseline methods.
Wenli ZHU Min ZHANG Chenxi WU Lingqing ZENG
A convolutional neural network (CNN) for broadband direction of arrival (DOA) estimation of far-field electromagnetic signals is presented. The proposed algorithm performs a nonlinear inverse mapping from received signal to angle of arrival. The signal model used for algorithm is based on the circular antenna array geometry, and the phase component extracted from the spatial covariance matrix is used as the input of the CNN network. A CNN model including three convolutional layers is then established to approximate the nonlinear mapping. The performance of the CNN model is evaluated in a noisy environment for various values of signal-to-noise ratio (SNR). The results demonstrate that the proposed CNN model with the phase component of the spatial covariance matrix as the input is able to achieve fast and accurate broadband DOA estimation and attains perfect performance at lower SNR values.
Music classification has been inspired by the remarkable success of deep learning. To enhance efficiency and ensure high performance at the same time, a hybrid architecture that combines deep learning and Broad Learning (BL) is proposed for music classification tasks. At the feature extraction stage, the Random CNN (RCNN) is adopted to analyze the Mel-spectrogram of the input music sound. Compared with conventional CNN, RCNN has more flexible structure to adapt to the variance contained in different types of music. At the prediction stage, the BL technique is introduced to enhance the prediction accuracy and reduce the training time as well. Experimental results on three benchmark datasets (GTZAN, Ballroom, and Emotion) demonstrate that: i) The proposed scheme achieves higher classification accuracy than the deep learning based one, which combines CNN and LSTM, on all three benchmark datasets. ii) Both RCNN and BL contribute to the performance improvement of the proposed scheme. iii) The introduction of BL also helps to enhance the prediction efficiency of the proposed scheme.
Rui CHEN Ying TONG Ruiyu LIANG
Deep neural networks have achieved great success in visual tracking by learning a generic representation and leveraging large amounts of training data to improve performance. Most generic object trackers are trained from scratch online and do not benefit from a large number of videos available for offline training. We present a real-time generic object tracker capable of incorporating temporal information into its model, learning from many examples offline and quickly updating online. During the training process, the pre-trained weight of convolution layer is updated lagging behind, and the input video sequence length is gradually increased for fast convergence. Furthermore, only the hidden states in recurrent network are updated to guarantee the real-time tracking speed. The experimental results show that the proposed tracking method is capable of tracking objects at 150 fps with higher predicting overlap rate, and achieves more robustness in multiple benchmarks than state-of-the-art performance.
In this letter, the performance of a state-of-the-art deep learning (DL) algorithm in [5] is analyzed and evaluated for orthogonal frequency-division multiplexing (OFDM) receivers, in the presence of harmonic spur interference. Moreover, a novel spur cancellation receiver structure and algorithm are proposed to enhance the traditional OFDM receivers, and serve as a performance benchmark for the DL algorithm. It is found that the DL algorithm outperforms the traditional algorithm and is much more robust to spur carrier frequency offset.
Yuya SENZAKI Satsuya OHATA Kanta MATSUURA
Research on adversarial examples for machine learning has received much attention in recent years. Most of previous approaches are white-box attacks; this means the attacker needs to obtain before-hand internal parameters of a target classifier to generate adversarial examples for it. This condition is hard to satisfy in practice. There is also research on black-box attacks, in which the attacker can only obtain partial information about target classifiers; however, it seems we can prevent these attacks, since they need to issue many suspicious queries to the target classifier. In this paper, we show that a naive defense strategy based on surveillance of number query will not suffice. More concretely, we propose to generate not pixel-wise but block-wise adversarial perturbations to reduce the number of queries. Our experiments show that such rough perturbations can confuse the target classifier. We succeed in reducing the number of queries to generate adversarial examples in most cases. Our simple method is an untargeted attack and may have low success rates compared to previous results of other black-box attacks, but needs in average fewer queries. Surprisingly, the minimum number of queries (one and three in MNIST and CIFAR-10 dataset, respectively) is enough to generate adversarial examples in some cases. Moreover, based on these results, we propose a detailed classification for black-box attackers and discuss countermeasures against the above attacks.
Andros TJANDRA Sakriani SAKTI Satoshi NAKAMURA
Recurrent Neural Network (RNN) has achieved many state-of-the-art performances on various complex tasks related to the temporal and sequential data. But most of these RNNs require much computational power and a huge number of parameters for both training and inference stage. Several tensor decomposition methods are included such as CANDECOMP/PARAFAC (CP), Tucker decomposition and Tensor Train (TT) to re-parameterize the Gated Recurrent Unit (GRU) RNN. First, we evaluate all tensor-based RNNs performance on sequence modeling tasks with a various number of parameters. Based on our experiment results, TT-GRU achieved the best results in a various number of parameters compared to other decomposition methods. Later, we evaluate our proposed TT-GRU with speech recognition task. We compressed the bidirectional GRU layers inside DeepSpeech2 architecture. Based on our experiment result, our proposed TT-format GRU are able to preserve the performance while reducing the number of GRU parameters significantly compared to the uncompressed GRU.
Feng LIU Xuecheng HE Conggai LI Yanli XU
For the frequency-division-duplex (FDD)-based massive multiple-input multiple-output (MIMO) systems, channel state information (CSI) feedback plays a critical role. Although deep learning has been used to compress the CSI feedback, some issues like truncation and noise still need further investigation. Facing these practical concerns, we propose an improved model (called CsiNet-Plus), which includes a truncation process and a channel noise process. Simulation results demonstrate that the CsiNet-Plus outperforms the existing CsiNet. The performance interchangeability between truncated decimal digits and the signal-to-noise-ratio helps support flexible configuration.
Xingyu ZHANG Xia ZOU Meng SUN Penglong WU Yimin WANG Jun HE
In order to improve the noise robustness of automatic speaker recognition, many techniques on speech/feature enhancement have been explored by using deep neural networks (DNN). In this work, a DNN multi-level enhancement (DNN-ME), which consists of the stages of signal enhancement, cepstrum enhancement and i-vector enhancement, is proposed for text-independent speaker recognition. Given the fact that these enhancement methods are applied in different stages of the speaker recognition pipeline, it is worth exploring the complementary role of these methods, which benefits the understanding of the pros and cons of the enhancements of different stages. In order to use the capabilities of DNN-ME as much as possible, two kinds of methods called Cascaded DNN-ME and joint input of DNNs are studied. Weighted Gaussian mixture models (WGMMs) proposed in our previous work is also applied to further improve the model's performance. Experiments conducted on the Speakers in the Wild (SITW) database have shown that DNN-ME demonstrated significant superiority over the systems with only a single enhancement for noise robust speaker recognition. Compared with the i-vector baseline, the equal error rate (EER) was reduced from 5.75 to 4.01.
Jinna LV Bin WU Yunlei ZHANG Yunpeng XIAO
Recently, social relation analysis receives an increasing amount of attention from text to image data. However, social relation analysis from video is an important problem, which is lacking in the current literature. There are still some challenges: 1) it is hard to learn a satisfactory mapping function from low-level pixels to high-level social relation space; 2) how to efficiently select the most relevant information from noisy and unsegmented video. In this paper, we present an Attentive Sequences Recurrent Network model, called ASRN, to deal with the above challenges. First, in order to explore multiple clues, we design a Multiple Feature Attention (MFA) mechanism to fuse multiple visual features (i.e. image, motion, body, and face). Through this manner, we can generate an appropriate mapping function from low-level video pixels to high-level social relation space. Second, we design a sequence recurrent network based on Global and Local Attention (GLA) mechanism. Specially, an attention mechanism is used in GLA to integrate global feature with local sequence feature to select more relevant sequences for the recognition task. Therefore, the GLA module can better deal with noisy and unsegmented video. At last, extensive experiments on the SRIV dataset demonstrate the performance of our ASRN model.
Umme Aymun SIDDIQUA Abu Nowshed CHY Masaki AONO
Stance detection in twitter aims at mining user stances expressed in a tweet towards a single or multiple target entities. Detecting and analyzing user stances from massive opinion-oriented twitter posts provide enormous opportunities to journalists, governments, companies, and other organizations. Most of the prior studies have explored the traditional deep learning models, e.g., long short-term memory (LSTM) and gated recurrent unit (GRU) for detecting stance in tweets. However, compared to these traditional approaches, recently proposed densely connected bidirectional LSTM and nested LSTMs architectures effectively address the vanishing-gradient and overfitting problems as well as dealing with long-term dependencies. In this paper, we propose a neural network model that adopts the strengths of these two LSTM variants to learn better long-term dependencies, where each module coupled with an attention mechanism that amplifies the contribution of important elements in the final representation. We also employ a multi-kernel convolution on top of them to extract the higher-level tweet representations. Results of extensive experiments on single and multi-target benchmark stance detection datasets show that our proposed method achieves substantial improvement over the current state-of-the-art deep learning based methods.
Kyungdeuk KO Jaihyun PARK David K. HAN Hanseok KO
In-class species classification based on animal sounds is a highly challenging task even with the latest deep learning technique applied. The difficulty of distinguishing the species is further compounded when the number of species is large within the same class. This paper presents a novel approach for fine categorization of animal species based on their sounds by using pre-trained CNNs and a new self-attention module well-suited for acoustic signals The proposed method is shown effective as it achieves average species accuracy of 98.37% and the minimum species accuracy of 94.38%, the highest among the competing baselines, which include CNN's without self-attention and CNN's with CBAM, FAM, and CFAM but without pre-training.
The aim of this paper is to propose effective attentional regions for fine-grained visual recognition. Based on the Spatial Transformers' capability of spatial manipulation within networks, we propose an extension model, the Attention-Guided Spatial Transformer Networks (AG-STNs). This model can guide the Spatial Transformers with hard-coded attentional regions at first. Then such guidance can be turned off, and the network model will adjust the region learning in terms of the location and scale. Such adjustment is conditioned to the classification loss so that it is actually optimized for better recognition results. With this model, we are able to successfully capture detailed attentional information. Also, the AG-STNs are able to capture attentional information in multiple levels, and different levels of attentional information are complementary to each other in our experiments. A fusion of them brings better results.
Xiao-Yi ZHAO Chao-Yi DONG Peng ZHOU Mei-Jia ZHU Jing-Wen REN Xiao-Yan CHEN
The paper employed an Alexnet, which is a deep learning framework, to automatically diagnose the damages of wind power generator blade surfaces. The original images of wind power generator blade surfaces were captured by machine visions of a 4-rotor UAV (unmanned aerial vehicle). Firstly, an 8-layer Alexnet, totally including 21 functional sub-layers, is constructed and parameterized. Secondly, the Alexnet was trained with 10000 images and then was tested by 6-turn 350 images. Finally, the statistic of network tests shows that the average accuracy of damage diagnosis by Alexnet is about 99.001%. We also trained and tested a traditional BP (Back Propagation) neural network, which have 20-neuron input layer, 5-neuron hidden layer, and 1-neuron output layer, with the same image data. The average accuracy of damage diagnosis of BP neural network is 19.424% lower than that of Alexnet. The point shows that it is feasible to apply the UAV image acquisition and the deep learning classifier to diagnose the damages of wind turbine blades in service automatically.
Zhuo ZHANG Yan LEI Jianjun XU Xiaoguang MAO Xi CHANG
Existing fault localization based on neural networks utilize the information of whether a statement is executed or not executed to identify suspicious statements potentially responsible for a failure. However, the information just shows the binary execution states of a statement, and cannot show how important a statement is in executions. Consequently, it may degrade fault localization effectiveness. To address this issue, this paper proposes TFIDF-FL by using term frequency-inverse document frequency to identify a high or low degree of the influence of a statement in an execution. Our empirical results on 8 real-world programs show that TFIDF-FL significantly improves fault localization effectiveness.
Soo-Yeon SHIN Dong-Myung KIM Jae-Won SUH
Due to improvements in hardware and software performance, deep learning algorithms have been used in many areas and have shown good results. In this paper, we propose a noise reduction framework based on a convolutional neural network (CNN) with deconvolution and a modified residual network (ResNet) to remove image noise. Simulation results show that the proposed algorithm is superior to the conventional noise eliminator in subjective and objective performance analyses.
To minimize the annotation costs associated with training semantic segmentation models and object detection models, weakly supervised detection and weakly supervised segmentation approaches have been extensively studied. However most of these approaches assume that the domain between training and testing is the same, which at times results in considerable performance drops. For example, if we train an object detection network using only web images showing a large object at the center, it can be difficult for the network to detect multiple small objects. In this paper, we focus on training a CNN with only web images and achieve object detection in the wild. A proposal-based approach can address the problem associated with differences in domains because web images are similar to images of the proposal. In both domains, the target object is located at the center of the image and the ratio of the size of the target object to the size of the image is large. Several proposal methods have been proposed to detect regions with high “object-ness.” However, many of these proposals generate a large number of candidates to increase the recall rate. Considering the recent advent of deep CNNs, methods that generate a large number of proposals exhibit problems in terms of processing time for practical use. Therefore, we propose a CNN-based “food-ness” proposal method in this paper that requires neither pixel-wise annotation nor bounding box annotation. Our method generates proposals through backpropagation and most of these proposals focus only on food objects. In addition, we can easily control the number of proposals. Through experiments, we trained a network model using only web images and tested the model on the UEC FOOD 100 dataset. We demonstrate that the proposed method achieves high performance compared to traditional proposal methods in terms of the trade-off between accuracy and computational cost. Therefore, in this paper, we propose an intermediate approach between the traditional proposal approach and the fully convolutional approach. In particular, we propose a novel proposal method that generates high“food-ness” regions using fully convolutional networks based on the backward approach by training food images gathered from the web.